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Abstract 

This paper studies the effect of hnguistic constraints on the large scale organization of language. It 
describes the properties of linguistic networks built using texts of written language with the words ran- 
domized. These properties are compared to those obtained for a network built over the text in natural 
order. It is observed that the "random" networks too exhibit small-world and scale-free characteristics. 
They also show a high degree of clustering. This is indeed a surprising result - one that has not been 
addressed adequately in the literature. We hypothesize that many of the network statistics reported here 
studied are in fact functions of the distribution of the underlying data from which the network is built 
and may not be indicative of the nature of the concerned network. 



1 Introduction 



Human language is a good example of a naturally occurring self-organizing complex system. Language, 
when modeled as a network, has been shown to exhibit small-world and scale-free properties 15 . A 



network is said to be scale-free if its degree distribution follows a a power-law JlJ. Moreover, it has also 
been suggested that the network evolves over time by following the rule of preferential attachment. That 
is, a new word being introduced into the system tends to associate with a pre-existing word that is highly 
frequent However, probably the first ever mathematical law that hinted at the complex nature of 



language was given by Zipf 22 . Zipf's law gives us the following relationship between the frequency 
of a word, F{r), in a language and its rank r, when words are ranked in order of their frequencies of 
occurrence (most frequent word assigned rank 1): 



Fir) 



C 
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Zipf observed this "law" for English (with C ~ 0.1 and awl) 17 , 16 but it has been shown to be 
obeyed by other natural languages [2]. It seems surprising at first that Zipf's law is also obeyed by 
random texts 17 , 16 . Random texts may be produced artificially, composed of symbols assigned prior 



probabilities of occurrence, the latter may be unequal. |11 demonstrates, through numerical simulation, 
that random texts follow Zipf's law because of the choice of rank as the independent variable in that 
relationship. Therefore, this paper concluded that Zipf's law is not an intrinsic property of natural 
language of any major consequence. Recent work \8j has shown that this is not the case. 

In 13 , the authors state that a linguistic network formed from a random permutation of words 



also exhibits a power-law. They attribute this to the nature of the linguistic network created which is 
essentially a co-occurrence network. A co-occurrence network, for our purposes, is a directed graph of 
words which are linked if they are adjacent to each other in a sentence. The authors reason that in 
such a network, the degree of a word is proportional to its frequency within the text from which the 
network is constructed. Thus, they say, permuting a text has no impact on the degree distribution of 
a co-occurrence network. This is not true unless duplicate edges can be added to the graph (which is 
clearly not the case). An illustration is the dummy sentence A B C D E B C (degrees are A=l, B=3, 
C=2, D=2, E=2) and one of its random permutations D B E C B A C (degrees are A=2, B=4, C=3, 
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D=l, E=2). The identities of the words that co-occur with a word, and whether the word itself occurs 
at the end or the beginning of a sentence clearly affect its degrees in the graph. 

In this work we analyze in detail the topology of the language networks at different levels of linguistic 
constraints. We find that networks based on randomized text exhibit small-world and scale-free charac- 
teristics. We also show that many of the network statistics we study are functions of the distribution of 
the underlying data from which the network is built. This study tries to shed light on several aspects 
of human language. Language is a complex structure where words act as simple elements that come 
together to form language. We try to understand why is it rather easy to link such simple elements to 
form sentences, novels, books,.. etc. We are also interested in studying the structural properties of word 
networks that would provide humans with easy and fast word production. We study such word networks 
at different levels of linguistics constraints to better understand the role of such constraints in terms of 
making mental navigation through words easy. In this study, we conduct experiments with four different 
languages (English, French, Spanish, and Chinese). This allows us to find out the features of word net- 
works that are common to all languages. The experiments we will describe throughout this paper were 
able to find some answers that will ultimately lead to the answers to those harder questions. 

We study the properties of linguistic networks at different levels of linguistic constraints. We study 
the networks formed by randomizing the underlying text, and how they compare to the properties of 
networks formed from plain English Broadly, two types of networks are built in this paper - frequent 
bigrams (should we call these collocations?) and co-occurrence networks. A frequent bigram can be 
defined as a sequence of two or more words that is statistically idiosyncratic. That is, in a certain 



context, the constituent words of a frequent bigram co-occur with a significant frequency. In 12 , a 
collocation (frequent bigram) is defined as a conventional way of saying something. Library card, phone 
booth and machine translation are examples of collocations. A frequent bigrams network, simply, is a 
graph of words as vertices such that there is an edge between two words provided they form a frequent 
bigram. Frequent bigrams networks are built from unscrambled text for the purpose of general study. 
Here, the method presented in (f? is improved upon by employing a Fisher's exact test to extract frequent 
bigrams 19 . A sample from the frequent bigrams graph constructed in this study is shown in figure [T] 



The contributions of this paper include: (1) we used uniformly distributed permutation algorithms 
for scrambling the corpus and added the Fisher exact test for detecting significantly correlated word 
pairs, (2) we obtained good results on networks induced from random corpus. Randomization of the 
corpus reduces linguistic constraints. Linguistic constraints reduce the small- worldness of the network!, 
(3) we used network randomization algorithms that preserve the degree distribution, and (4) we used 
lemmatized words and considered multiple languages. 

The paper is organized as follows: in section [2] we review some related work and put our work in 
context with respect to related work. Section [3] describes our methodology and how the different networks 
are created. Finally in section [4j we analyze the different properties of all networks. 

2 Related Work 

Network models of language have been studied by many researchers in the past [6j[2l] . 

One of the earliest such studies [6] constructed a "restricted" (frequent bigrams) network and an 
"unrestricted" (co-occurrence) network from a subset of the British National Corpus fsl . These networks 
were undirected. In that work, a bigram, WiWi+i, is treated as a frequent bigram if the probability of its 
occurrence, Pis+i, is found to be greater than pi ■ Pi+i, under the assumption of the constituent words 



being independent. This is a simplistic and unreliable filter for word dependence 12 . 19 demonstrates 
the suitability of Fisher's (right-sided) Exact test for this purpose. In the unrestricted network, words 
are linked if they co-occur in at least one sentence within a window of two words. Both these networks 
were shown to possess small-world characteristics with the average minimum distance between vertices 
« 3. They were also shown to be scale-free, with degree distributions following composite power-laws 
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Figure 1: Two hops from 'the' to 'president' in the frequent bigrams network 



with exponents 71 — —1.50 and 72 = —2.70 respectively. The latter exponent is approximately equal to 
7 — —3, obtained for a random graph constructed by employing the Barabasi-Albert (BA) model with 
preferential attachment. Also studied was a "kernel" network of the 5000 most connected words extracted 
from the restricted network. This network too was shown to follow a power-law degree distribution with 
7 = — 3. The authors go on to suggest that power-law coefficients obtained are indicative of language 
having evolved following the law of preferential attachment. They also suggest that the kernel network 
constructed here is representative of human mental lexicons; these lexicons possess the small- world feature 
which facilitates quick navigation from one word to another. 

Beyond universal regularities such as Zipf 's law, [20] recently examined burstiness, topicality, semantic 
similarity distribution and their interrelation and modeled them with two mechanisms, namely frequency 
ranking with dynamic reordering and memory across documents. Besides, large web datasets were used 
to validate the model. This paper and several other papers focused on modeling human written text with 
specific mechanisms. 

Network theory has been used in several studies about the structure of syntactic dependency networks. 
In [7], the author overview-ed the past studies on linguistic networks and discussed the possibilities and 



advantages of network analysis of syntactic dependency networks. In 10 , network properties such as 
small world structure, heterogeneity, hierarchical organization, betweenness centrality and assortativeness 
etc were examined for the syntactic networks from Czech, Romanian and German corpora. Seven corpora 
were examined by similar complex network analysis methods in ^9^ . Several common patterns of syntactic 
dependency networks were found. These patterns include high clustering coefficient of each vertex, the 
presence of a hierarchical network organization, disassortative mixing of vertices. In [5], spectral methods 
were introduced to cluster the words of the same class in a syntactic dependency network. 

In [14| , the authors examined the structural properties of two weighted networks, a linguistic network 
and a scientific collaboration network. The linguistic network in this paper is simply a co-occurrence 
network instead of syntactic dependency network. The weight of edges between vertices were considered 
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Figure 2: Degree distributions for ENG-COLL, ENG, ENG-RANSEN and ENG-RANDOC 



in the paper. The networks built from shuffled text were used as the null hypothesis to compare with 
the real network in order to find the characteristic of the real ones. Through the analysis of differences 
between the real network and a shuffled network, they proved that the scale free degree distribution are 
induced by Zipf's law. 

In [4j, the authors model the evolution of a network of language based upon preferential attachment. 
That is, a new word is connected to a word in the network with probability proportional to the latter's 
degree. The model that they develop, almost astonishingly, agrees very well with the empirical results 
obtained in [6| . The degree distribution of the theoretical network follows a composite power-law with 
exponents exactly equal to those obtained by This work further validates the scale-free nature of 
human language. 

A stochastic model of language was created on the basis of combined local and global preferential 



attachment (PA) in 13 . A new word attaches to the nearest neighbor with highest degree in the local 



PA scheme, whereas it attaches to the word with the highest degree in the entire graph in global PA. They 
find that their model produces a network with scale-free properties with various statistics in agreement 
with empirical evidence. They argue that plain PA schemes don't necessarily take into account the 
hierarchical nature of networks as is hinted at by Zipf's law and hence their combined model, which 
follows a mixed local-global growth is more suitable for the desired purpose. 
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Figure 3: Cumulative degree distribution for ENG 

3 Methodology 

In this article we consider the two aspects that have not been previously studied in depth in word 
cooccurrence networks: (1) the effect of the strength of links and (2) the word forms that vertices stand 
for. 

We consider four levels of cooccurrence constraints (from less to more constraints): 

1. RANDOC 

Words were permuted randomly within each document. Two vertices are linked if the corresponding 
words are adjacent in the permuted text, sentence boundaries were respected. That is, the length 
of the original sentences was kept constant 

2. HANSEN 

Words were permuted randomly within each sentence. Two vertices are linked if the corresponding 
words are adjacent in the permuted text. 

3. PLAIN 

Two vertices are linked if the corresponding words are adjacent in the original text. 

4. COLL 

Two vertices are linked if the corresponding words form a highly associative bigram. Fisher's 
Exact Test was used to extract highly associative bigrams (significance value < 0.01) from the 
unscrambled English text. Fisher's Exact Test is used because it is considered a more suitable test 
for determining word associativeness (19) We assume that the frequency associated with a bigram 
< word! >< 'word2 > is stored in a 2x2 contingency table: 

word2 ^word2 
wordl nil nl2 

-.wordl n21 n22 



where nil is the number of times < wordl >< word2 > occur together, nl2 is the number of times 
< wordl > occurs with some word other than word2, and so on. Fisher's exact test is calculated 
by fixing the marginal totals and computing the hypergeometric probabilities for all the possible 
contingency tables. 



We also consider two kinds of vertices 
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1. RAW 

Words are used in their raw forms. 

2. LEMM 

A lemmatized form of words is used. 

For each corpus, a different cooccurrence network is buih for each level of constraints and type of 
vertex. This results in 8 different networks for each language. We assume that that our graphs are 
undirected and that loops are allowed 

We define the structure of an undirected graph of n vertices through a binary adjacency matrix 
A — {aij}, where aij = 1 if the vertex i and the vertex j are linked and a^j = otherwise. Notice that 
the matrix is symmetric (a^ = a^fc) and an = 1 is possible. We define fc^, the degree of the i-th vertex, as 



The English text we used to construct all the networks comes from a random subset of the British 
National Corpus. The subset had 7.5M words. We used the Fisher- Yates shuffle algorithm for creating 
random permutations within sentences and documents. This algorithm produces each possible permuta- 
tion with equal probability. 

We report several general network statistics including average degree, diameter, average shortest path, 
global and local clustering coefficient. The diameters and average shortest paths were calculated using 
the maximum connected components while the other statistics were calculated used the whole network. 

In addition to English, we consider several other languages. We used a subset of the Spanish, 
French, and Chinese Gigaword corpora. We built four different networks for each language (RAN- 
DOC, RANSENT, PLAIN, and COLL). More statistics regarding the datasets used for constructing the 
networks is shown in Table |6l 



n 




(2) 



Table 1: Top ten frequent words in English corpora 
Word Frequency 



the 503322 

of 246620 

to 213108 

and 208294 

a 182067 

in 157755 

is 79274 

that 79265 

was 73143 

for 71614 



4 Results 



4.1 Degree Distribution 



The degree distributions for all four networks ENG-COLL, ENG, ENG-RANSEN and ENG-RANDOC 
are plotted in figure [2j It is worth mentioning that all figures are on a base 10 log-log scale. All 
four distributions show characteristics of a power-law. The tails of the corresponding cumulative degree 
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Figure 4: Line fitted to the tail of the cumulative degree distribution for ENG-COLL, ENG, ENG- 
RANSEN, ENG-RANDOC 
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Figure 5: Degree distribution for SPA-COLL, SPA, SPA-RANSEN, SPA-RANDOC, FRE-COLL, FRE, 
FRE-RANSEN, FRE-RANDOC, CHI-COLL, CHI, CHI-RANSEN, CHI-RANDOC 




Figure 6: Cumulative degree distribution for SPA-COLL, SPA, SPA-RANSEN, SPA-RANDOC, FRE- 
COLL, FRE, FRE-RANSEN, FRE-RANDOC, CHI-COLL, CHI, CHI-RANSEN, CHLRANDOC 
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Figure 8: Cumulative degree distribution for stemmed ENG-COLL, ENG, ENG-RANSEN and ENG- 
RANDOC 
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Table 2: Top ten most connected words in ENG-COLL 



Word 


Degree 


the 


2505 


of 


1492 


to 


1408 


a 


1095 


in 


913 


and 


795 


is 


522 


was 


483 


that 


458 


it 


436 



Table 3: Top ten most connected words in ENG 



Word 


Degree 


the 


60808 


and 


60514 


of 


41141 


in 


33967 


to 


30897 


a 


30810 


for 


21178 


with 


19912 


is 


19303 


as 


19158 



distributions are plotted in figure [4] and are fitted by lines. The power-law coefficient is fitted with the 
maximum likelihood method as recommended in [18| . The coefficients thus obtained from the plots are 
lENG-coLL = 2.0, -lENG = 2.3, lENG-RANSEN = 2.1 and lENG-RANDOC = 2.0. These are aU very 
similar values and are also close to the values obtained for co-occurrence and frequent bigrams (7 — 2.7 
each) networks in 

For Spanish, French and Chinese experiment, the degree distributions are plotted in Figure. [5] 
The corresponding cumulative degree distributions are plotted in Figure. [6] All distributions show 
characteristics of a power-law. The coefficients for the networks are "/spa-coll — 2.1, "/fre-coll = 

2.1, IcHi-coLL — 2.5 7spA = 2.3, ^fre = 2.2, ^chi — 2.8, ^spa-ransen — 2.2, jfre-ransen — 

2.2, JCHI-RANSEN = 2.3, JSPA-RANDOC — 2.1, JFRE-RANDOC = 2.1, JCHI-RANDOC = 2.7. These 

are similar to the English dataset (jeng = 2.3, Jransen = 2.1). 

The distributions of the networks based on the lemmatized form of the words are pretty similar to 
the original ones. The degree distributions for stemmed ENG-COLL, ENG and ENG-RANSEN are 
plotted in Figure [7] and Figure [8) The coefficients are Jstemeng-coll = 2.0, Jstemeng = 2.2, 
ISTEMENG-RANSEN = 2.2 and JSTEMENG-RANDOC = 2.1. The degree distributions for stemmed 
FRE-COLL, ERE, FRE-RANSEN and FRE-RANDOC are plotted in Figure[9]and Figure[l0] The coeffi- 
cients are jstemfre-coll = 2.3, jstemfre — 2.1, ^stemfre-ransen = 2.2 and jstemfre-randoc 
2.1. The degree distributions for stemmed SPA-COLL, SPA, SPA-RANSEN and SPA-RANDOC are 



plotted in Figure 11 and Figure 12 The coefficients are Jstemspa-coll = 2.4, Jstemspa = 2.3, 



ISTEMSPA-RANSEN = 2.1 and JSTEMSPA-RANDOC = 2.3. 
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Table 4: Top ten most connected words in ENG-RANSEN 



Word 


Degree 


the 


71498 


of 


50635 


and 


48239 


to 


42940 


a 


42750 


in 


39543 


is 


25264 


for 


24556 


that 


23786 


was 


23703 



Table 5: Top ten most connected words in ENG-RANDOC 



Word 


Degree 


the 


71708 


of 


49673 


to 


45677 


and 


45665 


a 


42779 


in 


39332 


is 


26648 


that 


26143 


for 


25133 


was 


24696 



4.2 Link density 

As expected, the more cooccurrence constraints, the lower the density of links. In particular, we find 
a perfect negative rank correlation between the mean degree of vertices and the level of constraints for 
all the networks of the same language and the same kind of vertex. More formally, wc have that the 
mean degree k obeys kcOLL < kpLAiN < kaANSEN < knANDOC- Knowing that there are not ties 
between values of k for the same language and kind of vertex, the probability that the expected ordering 
is produced by chance is 1/4! » 0.041. Thus, the perfect correlation between the level of constraints is 
statistically significant at a significance level of 0.05. 

Since the number of vertices of the network is the same for all networks of the same language (corpus) 
and the same kind of vertex, this perfect negative correlation is also equivalent to a perfect negative 
correlation between link density and level of constraints. The link density (5 of a network where loops are 
allowed is defined as the proportion of linkable pairs of vertices that are linked. For the particular case 
of an undirected network where loops are allowed, we have 



5 



^ n n 
\ 2 I i=l j=i 



l=i J- 

k 



n + r 



(4) 



where n is the number of vertices (if loops where not allowed this would he S = k/{n — 1). The link 
density of the network of raw words is smaller than that of the network of stemmed words with the same 
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Language Corpus length (in kilo words) Kind of vertex Number of different vertices 



ENG 


7500 


RAW 


118889 


ENG 


7500 


STEMMED 


71521 


PRE 


1700 


RAW 


49932 


PRE 


1700 


STEMMED 


33589 


SPA 


1300 


RAW 


61260 


SPA 


1300 


STEMMED 


29216 


CHI 


170 




4573 



Table 6: Corpus length is the length of the dataset used for constructing the network which may not 
coincide with the length of the whole corpus (the parenthesis show the proportion of the real corpus 
used). 
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Figure 9: Degree distribution for stemmed FRE-COLL, PRE, PRE-RANSEN and FRE-RANDOC 
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Figure 10: Cumulative degree distribution for stemmed FRE-COLL FRE, FRE-RANSEN and FRE- 
RANDOC 
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Figure 11: Degree distribution for stemmed SPA-COLL, SPA, SPA-RANSEN and SPA-RANDOC 

level of constraints, i.e. k < ksTEMMED or equivalently, 6 < 5stemmed- 
4.3 Small-worldness 

We define d as the shortest distance between a pair of nodes. The average shortest path is the average 
of all such distances. The diameter of a network is the number of links in the shortest path between 
the furthest pair of nodes. The diameter and the average shortest path for all networks are shown in 
Table [8] and Table ?? respectively. The diameters for all four networks, as shown in Table [8) are small. 
These networks are small worlds. What is surprising is that the diameters for RANSEN and RANDOC 
are smaller than that for ENG. This may be attributed to the randomization produces which forms links 
between words which are otherwise not connected in ENG. This results in a network that is even faster 
to navigate. 

The mean shortest vertex-vertex distance, d obeys, 

1. d> dsTEAiMED for all languages and constrain levels. 

2. dcoLL < duANDOC < dfiANSEN < dpLAiN in all languages except Chinese, where we have 

dRANDOC < dcOLL < dfiANSEN < dpLAIN- 

The diameter, i.e. the longest shortest path, d™"^ obeys 

1. d™"^ > d^^^j^jj^j^^ for all languages and constraint levels (we would have d™"^ > d^^^^f^^j^ if 
COLL was excluded). 



^- "-COLL ^ "'RANDOC ^ ^RANSEN ^ "-PLAIN 




Figure 12: Cumulative degree distribution for stemmed SPA-COLL, SPA, SPA-RANSEN and SPA- 
RANDOC 



17 



Table 7: Mean Degree (k) for all Networks 



Statistic 



COLL PLAIN HANSEN RANDOC 



ENG-RAW 
ENG-LEMM 

FRE-RAW 
FRE-LEMM 

SPA-RAW 
SPA-LEMM 



0.32 37.18 52.97 55.50 

0.53 45.54 70.88 74.85 

0.13 15.70 29.65 33.36 

0.24 20.24 38.70 44.07 

0.06 12.27 22.96 25.42 

0.17 19.67 33.24 36.59 

8.41 171.07 114.43 436.01 



CHI 



Table 8 
Statistic 



Diameters (o?'"" 
COLL PLAIN 



) for all networks 



RANSEN RANDOC 



ENG-RAW 
ENG-LEMM 

FRE-RAW 
FRE-LEMM 

SPA-RAW 
SPA-LEMM 



4 26 

4 22 

4 16 

4 14 

4 19 

4 13 

4 10 



17 7 

14 7 

14 8 

12 8 

13 10 
13 8 
6 5 



CHI 



4.4 Clustering 

We examined two clustering coefficients for all networks The local clustering is defined as the mean of 
the vertex clustering, 



where Ci is the clustering of the i-th vertex and n is the number of vertices of the network. Ci is the 
proportion of linkable pairs of adjacent vertices to vertex i that are linked . 

The global clustering coefhcient is based on triplets of nodes. A triplet could be either open or closed. 
In an open triplet, the three nodes are connected by two edges. Where as in a closed triplet, they 
are connected with three edges. The global clustering coefRcient is computing by calculating the ratio 
between the number of closed triplets and the total number of triplets. Both local and global clustering 
coefficients ignore the loops and the isolated vertices. View the networks as undirected. The local and 
global clustering coefficients for all networks are shown in Tables [TT] and [TO] respectively. 

4.5 Degree correlations 

Assortative mixing is a bias in favor of connections between network nodes with similar characteristics. 
In other words the higher the degree of a vertex, the higher the degree of its neighbors. On the other 
hand, disassortative mixing refers to the phenomenon where the higher the degree of a vertex the lower 
the degree of its neighbors. 

Figures [l3|14|15|16|17|18| and[T9]show the relation between the degree of a vertex k and the normalized 
mean degree of the nearest neighbors of vertices of degree k. We notice that the normalized mean degree 
of the nearest neighbors of vertices of degree k, shrinks as k grows for all networks, (i.e. vertices with 
large degree tend to connect with vertices with low degree). The figures show that the networks exhibit 
disassortative mixing pattern for all languages and levels of constraint and regardless of the kind of 




(5) 
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Table 9: Average Shortest Paths {d) for all networks. Standard deviation is shown in parenthesis 



Statistic 



COLL 



PLAIN 



RANSEN RANDOC 



ENG-RAW 
ENG-LEMM 

FRE-RAW 
FRE-LEMM 

SPA-RAW 
SPA-LEMM 
CHI 



2.48 (0.54) 

2.48 (0.54) 

2.49 (0.65) 
2.45 (0.58) 

2.50 (0.60) 
2.45 (0.59) 
2.30 (0.55) 



3.08 (0.72) 
3.02 (0.70) 
3.41 (0.83) 
3.30 (0.83) 
3.69 (0.98) 
3.19 (0.79) 
2.62 (0.71) 



2.95 (0.63) 
2.89 (0.60) 
3.07 (0.69) 
2.99 (0.69) 
3.06 (0.68) 

2.96 (0.68) 
2.32 (0.52) 



2.90 (0.56) 
2.83 (0.54) 
2.96 (0.58) 
2.86 (0.57) 
3.00 (0.62) 
2.88 (0.59) 
2.26 (0.54) 



Table 10: Global Clustering Coefficients for all networks 


Statistic 


COLL 


PLAIN 


RANSEN 


RANDOC 


ENG 


0.0660 


0.0236 


0.0455 


0.0492 


STEMMED ENG 


0.0660 


0.0548 


0.1018 


0.1086 


ERE 


0.1465 


0.0136 


0.0341 


0.0429 


STEMMED ERE 


0.1062 


0.0273 


0.0628 


0.0764 


SPA 


0.0730 


0.0068 


0.0193 


0.0250 


STEMMED SPA 


0.0500 


0.0272 


0.0521 


0.0600 


CHI 


0.2953 


0.3509 


0.2472 


0.5043 



the vertices (raw or lemmatized words). The relationship between fc„„ and fc is a consequence of word 
frequencies. The network randomization algorithms destroys the pattern for all complexity levels. 



4.6 The relationship between vertex frequency and degree 

Please, put a table showing the 5 or 10 most frequent words in the English corpus and the 5 or 10 most 
connected words in each of the networks: COLL, PLAIN, RANSEN and RANDOC. 

To examine the relationship between vertex frequency and degree, we use linear regression after 
excluding some outlier with frequency > 10"". The result of linear regression for the real network shows 
a strong correlation between degree and frequency. The slope is 2.86, the p value is less than 2 * 10^^^, 
is 0.89. For the random network, we also observe a high correlation between degree and frequency. 
The slope is 2.11, the p value of the correlation between two variables is less than 2 * 10"^^, is 0.89. 
In addition, we averaged the degree of words of a certain frequency and then make a plot for all the 

^ ^ Fig. ^ ^ ^ ^ ^ 



frequency range for all networks (Fig. 20 Fig. 21 



22 Fig. 23 Fig. 24 Fig. 25 Fig. 26) 



We notice that the higher the frequency of a word, the higher its degree. We also notice that the 
hubs of the networks are high frequency words. Lists of the top ten frequent words and the top ten most 
connected words in all English networks are shown in Tables [T] [2j [3j |4] and [5j 



5 Conclusion 

We studied the topological properties of linguistics networks at different levels of linguistic constraints. 
We found out that the networks produced from randomized data exihibit small worlds and scale-free 
characteristics. One possible explanation would be that degree distributions are functions of the word 
frequencies. However human language is a very complex "system" and there is no simple way to explain 
this observation. We also find out hat the degree distribution of a co-occurrence graph does change under 
randomization Further, network statistics such as diameter and clustering coefficient too seem to depend 
on the degree distributions of the underlying network. 
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Figure 13: A;„„(/c) for ENG-COLL, ENG, ENG-RANSEN, ENG-RANDOC and corresponding random 
networks 
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Figure 14: for stemmed ENG-COLL, ENG, ENG-RANSEN, ENG-RANDOC and corresponding 

random networks 
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Figure 15: fc„„(fc) for SPA-COLL, SPA, SPA-RANSEN, SPA-RANDOC and corresponding random net- 
works 
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Figure 16: A:„„(fc) for stemmed SPA-COLL, SPA, SPA-RANSEN, SPA-RANDOC and corresponding 
random networks 
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Figure 17: for FRE-COLL, FRE, FRE-RANSEN, FRE-RANDOC and corresponding random 

networks 
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Figure 18: A;„„(A;) for stemmed FRE-COLL, FRE, FRE-RANSEN, FRE-RANDOC and corresponding 
random networks 
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Figure 19: A;„„(A;) for CHI-COLL, CHI, CHI-RANSEN, CHI-RANDOC and corresponding random net- 
works 




Figure 21: Averaged Degree by word frequency for stemmed ENG 
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frequency frequency 

Figure 22: Averaged Degree by word frequency for SPA 




Figure 23: Averaged Degree by word frequency for stemmed SPA 
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Figure 25: Averaged Degree by word frequency for stemmed FRE 
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Table 11: Local Clustering CoefEcients for all networks 



Statistic 


COLL 


PLAIN 


RANSEN 


RANDOC 


ENG 


0.9605 


0.5520 


0.5910 


0.6345 


STEMMED ENG 


0.9605 


0.5991 


0.6582 


0.7143 


FRE 


0.7885 


0.4550 


0.5467 


0.5612 


STEMMED FRE 


0.9235 


0.4758 


0.5745 


0.6172 


SPA 


0.7945 


0.4017 


0.5315 


0.5094 


STEMMED SPA 


0.8723 


0.5044 


0.5812 


0.5844 


CHI 


0.7688 


0.5331 


0.6101 


0.8104 
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